SLIP

SLIP Technology Browser Exercise II

November 19, 2001

Obtaining Informational Transparency with Selective Attention

Dr. Paul S. Prueitt

President, OntologyStream Inc

November 15, 2001

SLIP Technology Browser Exercise II

{Eventname, d_port}

November 19, 2001

One needs two WinZip files, sSLIP and tSLIP. Exercise I used files vSLIP and wSLIP. These files are available from beadmaster@ontologystream.com.

Review:

As in Exercise I, an analytic conjecture was developed that linked defender ports through a non-specific relationship. A RealSecure summary intrusion event database was used. We used 14,475 records from April 15, 2001. The RealSecure columns that are used are:

{ record, ename, protocol, s_port, d_port, s_addname, d_addname, epriority }

s_addname is the IP address of the source and d_addname is the IP of the defender.

Let us review now the formal concept on analytics conjectures.

Figure 1: A simplest form of an analytic conjecture

Formally we have:

( a₁ , b ) + ( a₂ , b ) à < a₁ , r, a₂ >

where r is the non-specific relationship.

The “b” values are from one column in the intrusion event log and the “a” values are from a second column in the intrusion event log.

We call “b” the “first name” and “a” the “second name”. The set { a } define the sets of atoms that are categorized.

Incident event description

Incident event description can be automatically constructed using emergent computing and a reordering of the values of a derived Report. The purpose of Exercise II is to explore the Incident level event description as a graph (see Figure 2)

The set { b } provides means to define the incident level events that result from the SLIPstream emergent computing technique.

We saw in Exercise I, that ordering by the b values produces a global incident event map. This map is often incomplete but can be used to begin the process of developing a model of each of a number of currently occurring global incident events. We also have seen that other columns can be used in a similar fashion. Taken together, there may be an automated means to present small topic maps with specific relationships that then the domain experts (working within their security environment) may use to profile the global events of interest.

Part of the reason for Exercise II is to help in the design of a graphical visualization of many possible emergent pattern from a single SLIP analytic conjecture. In Exercise I we developed an incident event map for category B1. B1 was selected from a small widely distributed cluster at the top node, and a then the residue (complement of B1 in A1) was developed. This residue was then re-clustered to again manifest the three major clusters that one saw in A1 and that was not part of B1 (see Figures 1, 2, and 5 in Exercise I).

Figure 2: A graphic depicting the construction and display of the event map

In Exercise I one might assume that the category B1 and the event map in Figure 2 was a “complete incident” event type. An event type is something that we can profile because the events in the event record reoccur with sufficient similarity so as to be recognized as being an instance of the event type.

Events maps may in fact be major components of complete incident events types – but not the complete event type. This issue will need some study. We need to look at the small clusters at the bottom of a standard construction of a slip framework, where all of the major clusters are removed on the first pass and then the re-cluster shows us the remnants of what is left. The computational machinery to enable a domain expert to do this is being developed.

In Exercise II (this exercise set) we look at a specific atom and find the minimal cluster that contains as much informational relationships as possible. This technique is a type of min-max problem but is not statistical. It is a categorical min max problem that has been worked out in general terms.

The Example:

Let the first column, (b), be the RealSecure Source IP designation (s_addname) and the second column, (a), be the defensive port (d_port). There are 602 atoms in the top-level category A1 of the associated SLIP Framework. There are 49 unique ename values in the event log. These are the b values that can be used in the development of event maps.

The set of paired d_port values has 47,780 paired values, each part of the pair being a port value. The pairs are defined through the analytic conjecture graph, Figure 1.

Pre-exercise: Start the SLIP.exe in a folder with a folder named ‘data”. You need only have two text files to start with.

1) Paired.txt is the file containing the 47,780 pairs of port values.

2) Datawh.txt (Data Warehouse) is the file containing the 14,475 RealSecure summary events records.

These two files are memory mapped and then searched using new algorithms invented for this purpose. Paired.txt is searched several hundred thousands times to produce the clustering. Datawh.txt is searched up to a few hundred times in order to produce a report from a cluster of atoms.

Focus on the Report

The report has all of those intrusion event log records that have a value equal to one of the atoms in the category. When the Report is ordered by first name, then the structure of incident level events is revealed.

What is new in this set of exercises?

Clearly an important question is about if our Report retrieval is the right retrieval. The order in which the records are displayed may reveal what the non-specific relationship means. If this is true, then we have a new way of automatically developing some graph structures that indicate the global characteristics of “events” that occur

1) at random times and places

2) not exactly in a fixed set of steps taken in order

3) with variations and mutations of the pattern over time

Before we address the question about the value of the Report directly, lets look at the typical way that a SLIP Framework might be developed.

Perhaps the most obvious way to develop categories is to take the large clusters, put each one into a category, and then put the remainder of the atoms into the residue category.

a b

Figure 3: A common way to develop the SLIP framework using large clusters

In Figure 3 we show a series of steps where one starts with the top category and then develops the B layer of categories.

Exercise 1, Part One:

The steps to repeat this process are as follows.

S.1: Unpack either vSLIP or wSLIP. Remove all content of the Data folder except for the two files:

Paired.txt , Datawh.txt

A FoxPro program develops Paired.txt from two columns of an event log.. Datawh.txt is our concept of a data warehouse, but in this case each Datawh.txt is just the flat table from a log file. Using a new concept, we memory map both files.

Within a short period, we well have completed a second SLIP Browser, called the SLIP Warehouse Browser. This will start with a simple tab delineated ACSII file from an event log and assist the user in developing the Analytic Conjecture from a large number of possibilities viewed as simple graphs. One result of this process will be the two files, Paired.txt and Datawh.txt.

<Technical Note>

The SLIP Warehouse Browser, and the SLIP Warehouse Browser share a common look and feel. Neither will have a dependency on a standard relational database management system. What is being demonstrated by several innovative data mining companies (InMentia and TimesTen) is that an Informational Reference Base (RIB) is possible.

1) A RIB is memory mapped for very high speed sorts and retrievals

2) A RIB does not support insertion or deletions except by rewriting the entire data structure (which takes time)

3) Most important, perhaps, is that a specific RIB can be developed to minimally support ONLY those data transformations that are essential to the tasks at hand

I use the term “RIB” as a generic term for many of these In-Memory databases that are designed as data aggregation platforms. The two browsers will demonstrate that these three properties allow types of human computer interactions that is not otherwise possible.

<End Technical Note>

S.2: So starting with only these two files in place, double click on the SLIP.exe icon.

S.3: When the program completes its start up (takes less than a second), locate the command line and type in “extract”. Extract will parse the Paired.txt and develop a set of atoms. These are simply the unique occurrences of either the first part of the pair or the second part of the pair. In both examples to follow the pairs are pairs of d-ports (defensive ports). In this first example, the pairs are specified by a non-specific relationship (Figure 1) with attacker IP addresses.

S.4: By clicking on the A1 node that appears, one will see the random distribution of the atoms that have been extracted. One also sees the number of atoms and other metadata in the Topic Properties Window.

S.5: Cluster quite a long ways out, say by typing “c 500”, in the command line. One will see the KiloIters (number of thousand iterations) change as well as the distribution. When the iterations stop (the hour glass will go away) one can start the iterations again by typing “c” or “c 100”, or “c n” where n is any positive integer. You will see limiting distributions that are “similar” to that seen in Figure 3a.

S.6: If your computer has a fast processor, you may wish to see the limiting behavior after many millions on iterations. You will see that some of the isolated dots become stationary and that the large clusters will continue to move around. Meta-stable transitions between small mini patterns are likely in the limiting process. But this phenomenon is not likely to be interesting from the point of view of understanding incident events, at least not now.

Exercise 1, Part Two

Again, take the data structure in sSLIP.zip. The number of atoms extracted during the set up is 693. You can check this out by deleted all of the data except the two files: Paired.txt , Datawh.txt. And follow the steps in Exercise 1, Part One. Or you can just look in the Topics Properties window.

The analytic conjecture involves source IP addresses (as the b values) and d_port as the a values. The atoms are d_ports.

If we cluster this set of atoms we find a number of disconnected small clusters and three large clusters.

However port 600 is always gathered with ports

{ 608, 580, 576, 544, 428, 427, 400, 185, 164 }

We would like to draw an event map for all things related to port 600. In this data set, we can easily tell that a link must be between the Port_Scan event and the Port_600_Dest event.

The primary question is how to get all or most of the informational linkage that arises from the data.

We show the location of the port 600 atom in Figure 4a. You can use the original data in sSLIP.zip to see the SLIP Framework developed in Figure 4 (a – e). .

By capturing that Port 600 atom at the top level, we should be able to find the prime that this atoms sets in.

In clustering A1, we tried several times to get the Port 600 atoms along with all others that have a link relationship with it. This process can be automated with a command like “find prime with atom ‘atomValue’ “. One can locate 600 in the member’s window and then type in the degree to find the location of that atom.

a b

c d

Figure 4: The development of an analysis specific to Port 600.

Figure 4a shows the 693 atoms clustered with an iteration of 400,000 times. We have found the atom named 600 (Port 600) at degree 81. Typing “81” in the command line creates a red line pointing to the atom. We use the command “68, 85 -> B1” to create a category B1 and put the 62 atoms between degree 68 and degree 85 into this category.

We re-scatter/gather and locate the atoms in a cluster and put the 31 atoms so identified into category C1. Category C1 is very interesting since one would expect that it is prime. By “prime” we mean that there is a formal relationship the can be used to proof that the atoms will eventually move to the same location.

The category A1 is not prime. C1 DOES turn out to be prime, however the distribution is initially widely separated. This is seen in Figure 4e. It can also be seen that I have captured the two clusters separately into category D1 and D2.. the residue has only 2 atoms.

Exercise 1 Part Three

We can see that port atom 600 is in the cluster D2, alone with 29 other port atoms.

Figure 5: A prime cluster the has port 600 as a member

More on this small data structure is discussed with the aid of the three appendices.

In Figure 5, we see eight data points is a start to a graph that is potentially fully connected by specific relationships such as order in time. The category of IP addresses that are ‘represented” here only by the first occurrence. These data points are derived in the following way:

P1: D2 is identified as being a cluster that is separated from D1, while still having a weak link that will eventually pull all of the 30 atoms together (C1 is prime). However, the stratified SLIP theory suggests that a prime can be split given certain circumstances identified by formal theorems (see SLIP Data Structures and Programs, dated, October 24-25.) The early iterations of C1 clearly show an unusual situation. (One can unzip a second copy of sSLIP.zip and (1) select the C1 node, type random, and then cluster 30. The fast algorithm will iterate 30,000 times, each time selecting a pair and seeing if this pair is in the Paired.txt file (this file being memory mapped as a RIB (see the technical note) to make this iteration very fast.

P.2: A report is generated using the atoms to retrieval all intrusion level events on April 15^th that have the atom as its d_port. Clearly we will get things outside of a “global incident event” if the temporal window and IDS parameters are not narrowly focused. (But perhaps the SLIP technology can help us define this window and filter.)

P.3: But the most import specific relationship, we can clearly identify (even automatically), are those pairs of ports sharing a common source IP. The full set of 966 RealSecure events is ordered by source IP. In appendix A, we view here only the d-port, source IP and event name.

P.4: In Appendix B the list is reordered uniquely using the concatenation of d-port and source IP. Appendix B This produces 202 categories where the d-port and source IP are unique (occur only once)

The assumption here is that the first event name will characterize the part of the global event that involves this d-port and this source IP

P.5: In Appendix C, make a further reduction by looking only at the first occurrence of each event name. We order the data in Appendix B to get the unique event names and the first combined d_port+ source IP